Introduction
The dataset used is the Enron email dataset which is the collection of public domain emails from the Enron corporation. The emails have been manually classified as spam and ham (non-spam). The objective is to create a supervised classification pipeline to classify emails as spam or non-spam from the training data. We will be comparing various superivised classification models and compute the accuracy of models and thereby selecting the most accurate model.
Various Steps involved:
Importing the required libraries
import numpy as np
import pandas as pd
import re
import nltk
import plotly.express as plt
import matplotlib.pyplot as py
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import plot_roc_curve
from sklearn.model_selection import cross_val_score
nltk.download('stopwords')
# installing plotly package
#!pip install plotly
Loading Dataset
enron_data = load_files("C:\\Users\\dalal\\Desktop\\CIT Modules + Lectures + Materials\\- Sem 2\\Applied Machine Learning - COMP9060_26651\\Project 1\\enron1")
print ("Total number of Emails loaded: %d emails" % len(enron_data.filenames))
print("Categories Loaded: ",enron_data.target_names)
With the help of 'load_files' function from the sklearn library [1],text files which contains individual emails with categories 'ham' and 'spam' as subfolder names are loaded. 'load_files' function outputs 'data' and 'target'.
There are 5172 total emails present from both the folders and the categories are 'ham' and 'spam'.
# Creating two lists to store the emails and category of the email (ham or spam)
email_message, email_category = [], []
email_message = np.append(email_message, enron_data.data)
email_category = np.append(email_category, enron_data.target)
dict = {'Email Body' : email_message,'Email Category' : email_category}
email_dataframe = pd.DataFrame(dict)
email_dataframe["Email Body"] = email_dataframe["Email Body"].astype(str)
email_dataframe["Email Category"] = email_dataframe["Email Category"].astype(str)
Creating the two lists 'email_message' to store the emails and 'email_category' to store category of the email (ham or spam) using the numpy library. 'target' variable reads the category in float type where 0.0 is ham and 1.0 is spam.
'email_dataframe' Data Frame is created by adding these lists to a dictionary using pandas library.
Since both the Series object in the Data Frame are taking float values, we are converting to string data type in order to perform string operations.
# Changing the Email Category value 0.0 to ham and 1.0 to spam
email_dataframe["Email Category"] = email_dataframe["Email Category"].replace({"0.0":"ham","1.0":"spam"})
print(email_dataframe.describe())
email_dataframe.head()
Changing the 'Email Category' value 0.0 to ham and 1.0 to spam in order to display clearly.
From the 5172 emails, we notice that there are 4994 emails which are unique. So the dataset contains duplicated emails.
Cleaning the Data
Dropping Dupicates and checking if NAN's are present
# Removing the duplicate emails
email_dataframe = email_dataframe.drop_duplicates()
email_dataframe.shape
After removing the duplicates, there are 4994 emails which are unique.
# Checking if any null values or NAN or NA values are present that needs to be removed
email_dataframe["Email Body"].isnull().sum()
There are no null values or NAN or NA values present that needs to be removed.
Email Cleaning
# Removing special characters like !,@, ., %, etc and keeping only alphabets and numbers
clean_email_no_special = []
for email in email_dataframe["Email Body"]:
string = email
#converting the words to lower case
string = string.lower()
#removing the letter b' which is the very first word
string = re.sub(r'^b+', '', string)
#list of special characters to be removed
special_char = ['subject','\n','\\n','\t','\\r','\r','\'','"',':',';','.','/','!','#','$','%','{','}','(',')','?','&',']','[','-','_']
for char in special_char:
string = string.replace(char,' ')
string = re.sub("[^A-Za-z0-9]+", ' ', str(string))
# removing numbers from the string
pattern = '[0-9]'
string = re.sub(pattern, '', string)
#Removing single characters in the string
string = ' '.join( [w for w in string.split() if len(w)>1] )
clean_email_no_special.append(string)
Reading each email from the Data Frame, we have converted every word to lower case.
Every email has the character 'b' in the start which is removed.
Creating a list of special characters such as '\n','\n','\t','\r','\r','\'','"',':',';','.','/','!','#','$', etc are removed.
Removing the 'Subject' word as it appears in every email.
We are removing numbers from the string and also the single characters from the string.
'clean_email_no_special' list is obtained after cleaning.
An example of an email before and after cleaning:
Before Cleaning:
print([email_dataframe["Email Body"][10]])
After Cleaning:
print(clean_email_no_special[10])
Removing the stop words
stop_words = set(stopwords.words('english'))
clean_email_no_stopwords = []
for i in range(len(clean_email_no_special)):
word_tokens = word_tokenize(clean_email_no_special[i])
words = [w for w in word_tokens if not w in stop_words]
clean_email_no_stopwords.append(words)
In order to remove the stop words (such as “the”, “a”, “an”, “in”), we import those words from 'nltk' library in the corpus module. To remove stop words from a sentence from the list of emails, we need to divide the text into words and then remove the word if it exits in the list of stop words provided by NLTK. 'word_tokenize' function has been used in order to do that.
An example of an email before and after removing the stop words:
Before removing the stop words:
print(clean_email_no_special[1])
After removing the stop words:
print(clean_email_no_stopwords[1])
clean_email = []
for lis in clean_email_no_stopwords:
clean_email.append(' '.join(word for word in lis))
clean_email[1]
Joining the tokenized words back to form a string for an email. 'clean_email' contains the list of emails after cleaning and removing the stop words.
email_dataframe["Email Category"] = email_dataframe["Email Category"].replace({"ham":0,"spam": 1})
Spliting the Dataset into Train and Test data
# Splits the data into train and test dataset in a ratio of 70:30
features = clean_email
classes = email_dataframe["Email Category"]
email_train, email_test, class_train, class_test = train_test_split(features, classes, train_size=0.7, test_size=0.3, shuffle=True)
print("Emails in the Training set: %d emails" % len(email_train))
print("Emails in the Test set: %d emails" % len(email_test))
Spliting the data into training and test data in a ratio of 70:30, there are 3495 emails in the training set and 1499 emails in the test set.
# Splits the traning data into train and validation dataset in a ratio of 80:20 in order to carry out cross validation on the Validation set
email_train, email_val, class_train, class_val = train_test_split(email_train, class_train, train_size=0.8, test_size=0.2, shuffle=True)
print("Emails in the Training set: %d emails" % len(email_train))
print("Emails in the Validation set: %d emails" % len(email_val))
Spliting the traning data into train and validation dataset in a ratio of 80:20 in order to carry out cross validation on the Validation set, there are 2796 emails in the training set and 699 emails in the validation set.
class_train.value_counts()
There are 1968 emails for 'ham' category and 828 emails for the 'spam' category in the Training set.
# barplot of frequency of ham and spam in train data
train = pd.DataFrame(data=class_train)
train["Email Category"] = train["Email Category"].replace({0:"ham",1:"spam"})
barplot_train = train['Email Category'].value_counts().plot(kind='bar',
title="Counts of ham and spam emails in Training data")
barplot_train.set_xlabel("Email Category")
barplot_train.set_ylabel("Count")
py.show()
Above figure shows the visual presentation of number of emails present in the Training set for ham and spam categories. There are 1968 emails for 'ham' category and 828 emails for the 'spam' category.
class_test.value_counts()
There are 1074 emails for 'ham' category and 425 emails for the 'spam' category in the Test set.
# barplot of frequency of ham and spam in test data
test = pd.DataFrame(data=class_test)
test["Email Category"] = test["Email Category"].replace({0:"ham",1:"spam"})
barplot_train = test['Email Category'].value_counts().plot(kind='bar',
title="Counts of ham and spam emails in Test data",
color="green")
barplot_train.set_xlabel("Email Category")
barplot_train.set_ylabel("Count")
py.show()
Above figure shows the visual presentation of number of emails present in the Test set for ham and spam categories. There are 1074 emails for 'ham' category and 425 emails for the 'spam' category.
Exploratory Data Analysis on the Training Set
#creating a dictionary for the Training Set
train_set_dict = {'Email Body' : email_train,'Email Category' : class_train}
train_set = pd.DataFrame(train_set_dict)
train_set["Email Category"] = train_set["Email Category"].replace({0:'ham',1: 'spam'})
train_set.head()
Top 20 Most Frequently Used Words in Ham Emails
Creating a subset with only ham type values
# creating a subset with only ham type values
ham = train_set.loc[(train_set["Email Category"] == 'ham')]
ham.head()
# concatenating all rows of the data set into one string
combined_ham_emails = ham["Email Body"].str.cat(sep=' ')
combined_ham_emails = combined_ham_emails.replace(',',' ')
# using word_tokenize to count the frequency of each word
words = nltk.tokenize.word_tokenize(combined_ham_emails)
word_dist = nltk.FreqDist(words)
# storing the data into dataframe
ham_frequent_words = pd.DataFrame(word_dist.most_common(20),
columns=['Word', 'Frequency'])
print("Top 20 Most Frequently Used Words in Ham Emails:")
print(ham_frequent_words)
To find the most frequent words in the ham emails, we first combine all the emails into one string. Then using 'word_tokenize' to count the frequency of each word, we store the data into dataframe 'ham_frequent_words'.
# Barplot of most frequent words used in ham emails
fig = plt.bar(ham_frequent_words, x='Word', y='Frequency', color='Frequency',
labels={'Word':'Most frequently used words in ham','Frequency':'Frequency of of each word'},height=500)
fig.show(renderer="notebook")
Above figure shows the visual presentation in the bar chart [2] of the 20 most frequent words used in ham emails.
In order to render this dynamic plot in the HTML document, we have used 'fig.show(renderer="notebook")'.
Top 20 Most Frequently Used Words in Spam Emails
Creating a subset with only spam type values
# creating a subset with only spam type values
spam = train_set.loc[(train_set["Email Category"] == 'spam')]
spam.head()
# concatenating all rows of the data set into one string
combined_spam_emails = spam["Email Body"].str.cat(sep=' ')
combined_spam_emails = combined_spam_emails.replace(',',' ')
# using word_tokenize to count the frequency of each word
words = nltk.tokenize.word_tokenize(combined_spam_emails)
word_dist = nltk.FreqDist(words)
# storing the data into dataframe
spam_frequent_words = pd.DataFrame(word_dist.most_common(20),
columns=['Word', 'Frequency'])
print("Top 20 Most Frequently Used Words in Spam Emails:")
print(spam_frequent_words)
# Barplot of most frequent words used in ham emails
fig = plt.bar(spam_frequent_words, x='Word', y='Frequency', color='Frequency',
labels={'Word':'Most frequently used words in spam','Frequency':'Frequency of of each word'},height=500)
fig.show()
Above figure shows the visual presentation in the bar chart of the 20 most frequent words used in spam emails.
Boxplot for comparison of the distribution of email lengths in ham and spam emails
# creating a copy of the dataset and caculating the number of words of each row of 'Email Body' column
train_set_copy = train_set
train_set_copy['Word Count'] = train_set_copy['Email Body'].apply(len)
# Describing the Email Categories
print(train_set_copy.groupby('Email Category').describe())
# boxplot of distribution of email Category ham and spam
fig = plt.box(train_set_copy, x="Email Category",y="Word Count",
labels={'Type':'Type of email','word_count':'Total word count'})
fig.show()
The Spam emails are usually longer than the Ham emails since the median is greater for Spam emails. The mean value of the boxplot also confirms the same. The median length for ham emails is 270 while for spam is 387.
The maximum length of an email for both Spam and Ham are very close. The max value for ham emails is 20401 while for spam emails is 21432.
Both the Email Categories have outliers and there are extreme outliers above 20,000 value in the length for both the class.
Feature Extraction
vectorizer = TfidfVectorizer(min_df=2, encoding='utf-8', stop_words = 'english', analyzer='word')
message_train_Feature = vectorizer.fit_transform(email_train)
message_test_Feature = vectorizer.transform(email_test)
message_val_Feature = vectorizer.transform(email_val)
The purpose of feature extraction is to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text.
We use 'TfidfVectorizer' from the sklearn library to vectorize our features. TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features, which is equivalent to CountVectorizer followed by TfidfTransformer.
We apply fit on the training dataset and use the transform method on the training dataset, the test dataset and the validation dataset.
message_train_Feature.shape
message_test_Feature.shape
message_val_Feature.shape
Transform changes the data in the pipeline in order to pass it on to the next stage in the pipeline. We see that the shape is the same for all the dataset in order to process the vectorized Test and Validation set after fitting and transforming the training set.
Model Selection
We have selected various models such as Random Forest, K-Nearest Neighbor, Decision Tree, Multinomial Naive Bayes and Support Vector Classification to predict the outcomes for the Validation and Test Dataset and compare the accuracy scores of all the models thereby selecting the most accuarate model.
Random Forest Model
#Random Forest Model
random_forest = RandomForestClassifier(max_depth=10, n_estimators = 1000, random_state=0)
random_forest.fit(message_train_Feature,class_train)
# calculating the cross validation score for the Validation Set
rf_cv_result = cross_val_score(random_forest,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(rf_cv_result)
rf_cv_acc = round(np.mean(rf_cv_result)*100,2)
print("Accuracy of Random Forest on the Validation Set :",rf_cv_acc,"%")
We apply the 'RandomForestClassifier' from sklearn library with depth of the tree as 10 and number of estimators as 1000.
Firstly, we calculate the cross validation score for the Validation Set with 5 folds.
Taking the mean of those scores, we get the accuracy of Random Forest model on the Validation Set as 79.26 %.
#Predicting for the Test Dataset
rf_pred = random_forest.predict(message_test_Feature)
print("Classification Report for the Test Data")
print(classification_report(class_test, rf_pred))
rf_test_accuracy = metrics.accuracy_score(rf_pred, class_test)
rf_test_accuracy = round(rf_test_accuracy*100,2)
print("Accuracy of Random Forest Model on the Test Set :",rf_test_accuracy,"%")
The Accuracy of Random Forest model on the Test Set is 82.19 %.
print("CONFUSION MATRIX for Random Forest Model on the Test Data:")
rf_confusion_matrix = confusion_matrix(class_test, rf_pred)
rf_conf_matrix = pd.DataFrame(data = rf_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])
print(rf_conf_matrix)
From the confusion matrix, it can be observed that no ham email has been predicted as spam and 267 spam emails of test data has been predicted as ham.
K-Nearest Neighbor Model
#K-Nearest Neighbor Model
knn = KNeighborsClassifier(n_neighbors=3)
knn_model = knn.fit(message_train_Feature,class_train)
# Calculating the cross validation score for the Validation Set
knn_cv_result = cross_val_score(knn_model,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(knn_cv_result)
knn_acc = round(np.mean(knn_cv_result)*100,2)
print("Accuracy of KNN Model on the Validation Set :",knn_acc,"%")
We apply the 'KNeighborsClassifier' from sklearn library with number of neighbors as 3.
Firstly, we calculate the cross validation score for the Validation Set with 5 folds.
Taking the mean of those scores, we get the accuracy of K-Nearest Neighbor Model on the Validation Set as 59.07 %.
#Predicting for the Test Dataset
knn_predict = knn_model.predict(message_test_Feature)
print("Classification Report for the Test Data")
print(classification_report(class_test, knn_predict))
knn_test_accuracy = metrics.accuracy_score(knn_predict, class_test)
knn_test_accuracy = round(knn_test_accuracy*100,2)
print("Accuracy of K-Nearest Neighbor Model on the Test Set :",knn_test_accuracy,"%")
The Accuracy of K-Nearest Neighbor Model on the Test Set is 96.66 %.
print("CONFUSION MATRIX for K-Nearest Neighbor Model on the Test Data:")
knn_confusion_matrix = confusion_matrix(class_test, knn_predict)
knn_conf_matrix = pd.DataFrame(data = knn_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])
print(knn_conf_matrix)
From the confusion matrix, it can be observed that 13 ham emails have been predicted as spam and 37 spam emails of test data have been predicted as ham.
Decision Tree Model
# Applying Decision Tree Model
dec_tree = DecisionTreeClassifier()
dt_model = dec_tree.fit(message_train_Feature,class_train)
# Calculating the cross validation score for the Validation Set
dt_cv_result =cross_val_score(dec_tree,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(dt_cv_result)
dt_acc = round(np.mean(dt_cv_result)*100,2)
print("Accuracy of Decision Tree Model on the Validation Set :",dt_acc,"%")
We apply the 'DecisionTreeClassifier' from sklearn library.
Firstly, we calculate the cross validation score for the Validation Set with 5 folds.
Taking the mean of those scores, we get the accuracy of Decision Tree Model on the Validation Set as 89.7 %.
#Predicting for the Test Dataset
dt_pred = dt_model.predict(message_test_Feature)
print("Classification Report for the Test Data")
print(classification_report(class_test, dt_pred))
dt_test_accuracy = metrics.accuracy_score(dt_pred, class_test)
dt_test_accuracy = round(dt_test_accuracy*100,2)
print("Accuracy of Decision Tree Model on the Test Set :",dt_test_accuracy,"%")
The Accuracy of Decision Tree Model on the Test Set is 94.06 %.
print("CONFUSION MATRIX for Decision Tree Model on the Test Data:")
dt_confusion_matrix = confusion_matrix(class_test, dt_pred)
dt_conf_matrix = pd.DataFrame(data = dt_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])
print(dt_conf_matrix)
From the confusion matrix, it can be observed that 61 ham emails have been predicted as spam and 28 spam emails of test data have been predicted as ham.
Multinomial Naive Bayes Model
# Multinomial Naive Bayes Model
mnbc = MultinomialNB()
mnbc.fit(message_train_Feature,class_train)
# Calculating the cross validation score for the Validation Set
mnb_cv_result =cross_val_score(mnbc,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(mnb_cv_result)
mnb_acc = round(np.mean(mnb_cv_result)*100,2)
print("Accuracy of Multinomial Naive Bayes Model on the Validation Set :",mnb_acc,"%")
We apply the 'MultinomialNB' from sklearn library.
Firstly, we calculate the cross validation score for the Validation Set with 5 folds.
Taking the mean of those scores, we get the accuracy of Multinomial Naive Bayes Model on the Validation Set as 84.69 %.
#Predicting for the Test Dataset
MNBC_preds = mnbc.predict(message_test_Feature)
print("Classification Report for the Test Data")
print(classification_report(class_test, MNBC_preds))
mnb_test_accuracy = metrics.accuracy_score(MNBC_preds, class_test)
mnb_test_accuracy = round(mnb_test_accuracy*100,2)
print("Accuracy of Multinomial Naive Bayes Model on the Test Set :",mnb_test_accuracy,"%")
The Accuracy of Multinomial Naive Bayes Model on the Test Set is 96.26 %.
print("CONFUSION MATRIX for Multinomial Naive Bayes Model on the Test Data:")
mnb_confusion_matrix = confusion_matrix(class_test, MNBC_preds)
mnb_conf_matrix = pd.DataFrame(data = mnb_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])
print(mnb_conf_matrix)
From the confusion matrix, it can be observed that 5 ham emails have been predicted as spam and 51 spam emails of test data have been predicted as ham.
Support Vector Classification
# Support Vector Classification Model
svc = SVC()
svc_model = svc.fit(message_train_Feature,class_train)
# Calculating the cross validation score for the Validation Set
svc_cv_result =cross_val_score(svc,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(svc_cv_result)
svc_acc = round(np.mean(svc_cv_result)*100,2)
print("Accuracy of Support Vector Classification Model on the Validation Set :",svc_acc,"%")
We apply the 'SVC' from sklearn library.
Firstly, we calculate the cross validation score for the Validation Set with 5 folds.
Taking the mean of those scores, we get the accuracy of Support Vector Classification Model on the Validation Set as 92.13 %.
#Predicting for the Test Dataset
svc_pred=svc_model.predict(message_test_Feature)
print("Classification Report for the Test Data")
print(classification_report(class_test, svc_pred))
svc_test_accuracy = metrics.accuracy_score(svc_pred, class_test)
svc_test_accuracy = round(svc_test_accuracy*100,2)
print("Accuracy of Support Vector Classification Model on the Test Set :",svc_test_accuracy,"%")
The Accuracy of Support Vector Classification Model on the Test Set is 99.53 %.
print("CONFUSION MATRIX for Support Vector Classification Model on the Test Data:")
svc_confusion_matrix = confusion_matrix(class_test, svc_pred)
svc_conf_matrix = pd.DataFrame(data = svc_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])
print(svc_conf_matrix)
From the confusion matrix, it can be observed that 7 ham emails have been predicted as spam and no spam emails of test data have been predicted as ham.
For Support Vector Classification Model, ROC (Receiver Operating Chracteristic) curve is plotted for sensitivity (True Positive Rate - ratio of actual positives that are correctly classified) against specificity (1-False Positive Rate or True Negative Rate - ratio of actual negatives that are correctly classified).
svc_disp = plot_roc_curve(svc_model, message_test_Feature, class_test)
Comparison of accuracy of all the classifier models
Creating a Data Frame and storing the Accuracy scores of the Test Set of different models in order to compare.
# creating an empty data frame to store the Accuracy scores of the Test Set of different models
model_accuracy = pd.DataFrame(columns = ["Model", "Accuracy"])
# saving the accuracy in the dataframe
model_accuracy = model_accuracy.append({'Model':"Random Forest",'Accuracy':rf_test_accuracy},ignore_index=True)
model_accuracy = model_accuracy.append({'Model':"K-Nearest Neighbor",'Accuracy':knn_test_accuracy},ignore_index=True)
model_accuracy = model_accuracy.append({'Model':"Decision Tree",'Accuracy':dt_test_accuracy},ignore_index=True)
model_accuracy = model_accuracy.append({'Model':"Multinomial Naive Bayes",'Accuracy':mnb_test_accuracy},ignore_index=True)
model_accuracy = model_accuracy.append({'Model':"Support Vector Classification",'Accuracy':svc_test_accuracy},ignore_index=True)
model_accuracy
Bar Plot comparison of accuracy of all the classifier models
# Bar plot comparison of accuracy of all the classifier models
accuracy_comparison = plt.bar(model_accuracy, x='Model', y='Accuracy', color='Accuracy',
labels={'Model':'Model Name','Accuracy':'Accuracy of the Model'}, height=400)
accuracy_comparison.show()
Conclusion
From all the selected various models: Random Forest, K-Nearest Neighbor, Decision Tree, Multinomial Naive Bayes and Support Vector Classification, it is observed that the Support Vector Classification model fits best for this dataset for classifying emails as ham and spam with the model accuracy of 99.53 % as compared with other classifier models.